
Data cleaning
In this section we will handle the missing value of our dataset to make sure that we have a clean dataset to perform our EDA and modeling. We will first visualize the missing values in our dataset and then clean the missing values in the columns that we will use for our analysis. We will also remove some rows and columns that are not relevant for our analysis.
Let’s have a look at the entire dataset and its missing values in grey.
We can see that overall, we do not have many missing values in proportion with the size of our dataset. However, we can see that some columns have a lot of missing values. Below we have the detail of the percentage of missing values by columns.

Dealing with the columns Engine Cylinders and Engine Displacement

As we can see we have missing in 6 columns. Let’s first have a closer look at the engine cylinders and engine displacement columns. They both have 484 missing values. After some data manipulation, we see that these 484 missing are all electric vehicles and that they all have missing values in the engine cylinders and engine displacement columns. Given that in our dataset we have 484 vehicles, we now that theses missing in these column only concerns electric vehicles. This make sense since electric vehicle do not have an combustion engine and therefore those categories are not really applicable. We will therefore replace all missing values in this two columns with “none”.
As we can see, we still have some missing in the columns “Fuel Type 2”, “Engine Description”, “Drive” and “Transmission”. Let’s investigate the missing in the column “Drive”.
Dealing with the column Drive, Transmission and Engine Description

We decided to drop the brand with more than 10% of missing values in the “Drive” column. After this operation, we also removed the 8 observations that remained with missing values in the “Transmission” column. We decided to drop the column engine description since it contains missings values for more than a third of our observation.
Final dataset
The dataset is now cleaned and does not contain any missing values. It contains 42240 observations, 18 features and 129 brands. We renamed the columns and stored it in a csv file (data_cleaned.csv). However, for some models, we need to tackle the unbalanced classes in the target variable. For this reason we also created a new csv file for which we drop the make with less than 10 models (data_cleaned_reduced.csv). This dataset contains 42061 observations, 18 features and 66 brands.
Here are the cleaned datasets on which we are working on from now on.